32 research outputs found

    The Visual Centrifuge: Model-Free Layered Video Representations

    Full text link
    True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent assumptions on motion, lighting and shape. Here we propose a learning-based approach for multi-layered video representation: we introduce novel uncertainty-capturing 3D convolutional architectures and train them to separate blended videos. We show that these models then generalize to single videos, where they exhibit interesting abilities: color constancy, factoring out shadows and separating reflections. We present quantitative and qualitative results on real world videos.Comment: Appears in: 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019). This arXiv contains the CVPR Camera Ready version of the paper (although we have included larger figures) as well as an appendix detailing the model architectur

    Apprentissage structuré à partir de vidéos et langage

    Get PDF
    The goal of this thesis is to develop models, representations and structured learning algorithms for the automatic understanding of complex human activities from instructional videos narrated with natural language. We first introduce a model that, given a set of narrated instructional videos describing a task, is able to generate a list of action steps needed to complete the task and locate them in the visual and textual streams. To that end, we formulate two assumptions. First, people perform actions when they mention them. Second, we assume that complex tasks are composed of an ordered sequence of action steps. Equipped with these two hypotheses, our model first clusters the textual inputs and then uses this output to refine the location of the action steps in the video. We evaluate our model on a newly collected dataset of instructional videos depicting 5 different complex goal oriented tasks. We then present an approach to link action and the manipulated objects. More precisely, we focus on actions that aim at modifying the state of a specific object, such as pouring a coffee cup or opening a door. Such actions are an inherent part of instructional videos. Our method is based on the optimization of a joint cost between actions and object states under constraints. The constraints are reflecting our assumption that there is a consistent temporal order for the changes in object states and manipulation actions. We demonstrate experimentally that object states help localizing actions and conversely that action localization improves object state recognition. All our models are based on discriminative clustering, a technique which allows to leverage the readily available weak supervision contained in instructional videos. In order to deal with the resulting optimization problems, we take advantage of a highly adapted optimization technique: the Frank-Wolfe algorithm. Motivated by the fact that scaling our approaches to thousands of videos is essential in the context of narrated instructional videos, we also present several improvements to make the Frank- Wolfe algorithm faster and more computationally efficient. In particular, we propose three main modifications to the Block-Coordinate Frank-Wolfe algorithm: gap-based sampling, away and pairwise Block Frank-Wolfe steps and a solution to cache the oracle calls. We show the effectiveness of our improvements on four challenging structured prediction tasks.Le but de cette thĂšse est de dĂ©velopper des modĂšles, des reprĂ©sentations adaptĂ©es et des algorithmes de prĂ©diction structurĂ©e afin de pouvoir analyser de maniĂšre automatique des activitĂ©s humaines complexes commentĂ©es par du langage naturel. Dans un premier temps, nous prĂ©sentons un modĂšle capable de dĂ©couvrir quelle est la liste d’actions nĂ©cessaires Ă  l’accomplissement de la tĂąche ainsi que de localiser ces actions dans le flux vidĂ©o et dans la narration textuelle Ă  partir de plusieurs vidĂ©os tutorielles. La premiĂšre hypothĂšse est que les gens rĂ©alisent les actions au moment oĂč ils les dĂ©crivent. La seconde hypothĂšse est que ces tĂąches complexes sont rĂ©alisĂ©es en suivant un ordre prĂ©cis d’actions.. Notre modĂšle est Ă©valuĂ© sur un nouveau jeu de donnĂ©es de vidĂ©os tutorielles qui dĂ©crit 5 tĂąches complexes. Nous proposons ensuite de relier les actions avec les objets manipulĂ©s. Plus prĂ©cisĂ©ment, on se concentre sur un type d’action particuliĂšre qui vise Ă  modifier l’état d’un objet. Par exemple, cela arrive lorsqu’on sert une tasse de cafĂ© ou bien lorsqu’on ouvre une porte. Ce type d’action est particuliĂšrement important dans le contexte des vidĂ©os tutorielles. Notre mĂ©thode consiste Ă  minimiser un objectif commun entre les actions et les objets. Nous dĂ©montrons via des expĂ©riences numĂ©riques que localiser les actions aident Ă  mieux reconnaitre l’état des objets et inversement que modĂ©liser le changement d’état des objets permet de mieux dĂ©terminer le moment oĂč les actions se dĂ©roulent. Tous nos modĂšles sont basĂ©s sur du partionnement discriminatif, une mĂ©thode qui permet d’exploiter la faible supervision contenue dans ce type de vidĂ©os. Cela se rĂ©sume Ă  formuler un problĂšme d’optimisation qui peut se rĂ©soudre aisĂ©ment en utilisant l’algorithme de Frank- Wolfe qui est particuliĂšrement adaptĂ© aux contraintes envisagĂ©es. MotivĂ© par le fait qu’il est trĂšs important d’ĂȘtre en mesure d’exploiter les quelques milliers de vidĂ©os qui sont disponibles en ligne, nous portons enfin notre effort Ă  rendre l’algorithme de Frank-Wolfe plus rapide et plus efficace lorsque confrontĂ© Ă  beaucoup de donnĂ©es. En particulier, nous proposons trois modifications Ă  l’algorithme Block-Coordinate Frank-Wolfe : un Ă©chantillonage adaptatif des exemples d’entrainement, une version bloc des ‘away steps’ et des ‘pairwise steps’ initialement prĂ©vu dans l’algorithme original et enfin une maniĂšre de mettre en cache les appels Ă  l’oracle linĂ©aire

    Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

    Full text link
    Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision. The task is challenging due to the difficulty of bridging the semantic gap between the visual and natural language domains. This paper addresses the task of automatically generating an alignment between a set of instructions and a first person video demonstrating an activity. The sparse descriptions and ambiguity of written instructions create significant alignment challenges. The key to our approach is the use of egocentric cues to generate a concise set of action proposals, which are then matched to recipe steps using object recognition and computational linguistic techniques. We obtain promising results on both the Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions Dataset

    Unsupervised Learning from Narrated Instruction Videos

    Full text link
    We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering problems, one in text and one in video, applied one after each other and linked by joint constraints to obtain a single coherent sequence of steps in both modalities. Second, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains about 800,000 frames for five different tasks that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings. Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016). 21 page

    Cross-task weakly supervised learning from instructional videos

    Get PDF
    In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be trained jointly with other tasks involving `pour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.Comment: 18 pages, 17 figures, to be published in proceedings of the CVPR, 201

    Controllable Attention for Structured Layered Video Decomposition

    Full text link
    The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to. For example, to be able to separate reflections, transparency or object motion. We make the following three contributions: (i) we introduce a new structured neural network architecture that explicitly incorporates layers (as spatial masks) into its design. This improves separation performance over previous general purpose networks for this task; (ii) we demonstrate that we can augment the architecture to leverage external cues such as audio for controllability and to help disambiguation; and (iii) we experimentally demonstrate the effectiveness of our approach and training procedure with controlled experiments while also showing that the proposed model can be successfully applied to real-word applications such as reflection removal and action recognition in cluttered scenes.Comment: In ICCV 201

    Multi-Task Learning of Object State Changes from Uncurated Videos

    Full text link
    We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions such as pouring water and pouring coffee. Second, we design a multi-task self-supervised learning procedure that exploits different types of constraints between objects and state-modifying actions enabling end-to-end training of a model for temporal localization of object states and actions in videos from only noisy video-level supervision. Third, we report results on the large-scale ChangeIt and COIN datasets containing tens of thousands of long (un)curated web videos depicting various interactions such as hole drilling, cream whisking, or paper plane folding. We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods and significantly outperforms both image-based and video-based zero-shot models for this problem. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the robustness of our learned model
    corecore